Search CORE

17 research outputs found

Statistical approaches to the study of protein folding and energetics

Author: Burkoff Nikolas S.
Publication venue
Publication date
Field of study

The determination of protein structure and the exploration of protein folding landscapes are two of the key problems in computational biology. In order to address these challenges, both a protein model that accurately captures the physics of interest and an efficient sampling algorithm are required. The first part of this thesis documents the continued development of CRANKITE, a coarse-grained protein model, and its energy landscape exploration using nested sampling, a Bayesian sampling algorithm. We extend CRANKITE and optimize its parameters using a maximum likelihood approach. The efficiency of our procedure, using the contrastive divergence approximation, allows a large training set to be used, producing a model which is transferable to proteins not included in the training set. We develop an empirical Bayes model for the prediction of protein β-contacts, which are required inputs for CRANKITE. Our approach couples the constraints and prior knowledge associated with β-contacts to a maximum entropy-based statistic which predicts evolutionarily-related contacts. Nested sampling (NS) is a Bayesian algorithm shown to be efficient at sampling systems which exhibit a first-order phase transition. In this work we parallelize the algorithm and, for the first time, apply it to a biophysical system: small globular proteins modelled using CRANKITE. We generate energy landscape charts, which give a large-scale visualization of the protein folding landscape, and we compare the efficiency of NS to an alternative sampling technique, parallel tempering, when calculating the heat capacity of a short peptide. In the final part of the thesis we adapt the NS algorithm for use within a molecular dynamics framework and demonstrate the application of the algorithm by calculating the thermodynamics of allatom models of a small peptide, comparing results to the standard replica exchange approach. This adaptation will allow NS to be used with more realistic force fields in the future

Warwick Research Archives Portal Repository

Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.

Author: Csilla Várnai
David L Wild
Nikolas S Burkoff
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2017
Field of study

Evolutionary information stored in multiple sequence alignments (MSAs) has been used to identify the interaction interface of protein complexes, by measuring either co-conservation or co-mutation of amino acid residues across the interface. Recently, maximum entropy related correlated mutation measures (CMMs) such as direct information, decoupling direct from indirect interactions, have been developed to identify residue pairs interacting across the protein complex interface. These studies have focussed on carefully selected protein complexes with large, good-quality MSAs. In this work, we study protein complexes with a more typical MSA consisting of fewer than 400 sequences, using a set of 79 intramolecular protein complexes. Using a maximum entropy based CMM at the residue level, we develop an interface level CMM score to be used in re-ranking docking decoys. We demonstrate that our interface level CMM score compares favourably to the complementarity trace score, an evolutionary information-based score measuring co-conservation, when combined with the number of interface residues, a knowledge-based potential and the variability score of individual amino acid sites. We also demonstrate, that, since co-mutation and co-complementarity in the MSA contain orthogonal information, the best prediction performance using evolutionary information can be achieved by combining the co-mutation information of the CMM with co-conservation information of a complementarity trace score, predicting a near-native structure as the top prediction for 41% of the dataset. The method presented is not restricted to small MSAs, and will likely improve interface prediction also for complexes with large and good-quality MSAs

Crossref

University of Birmingham Research Portal

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

Predicting protein : sheet contacts using a maximum entropy-based correlated mutation measure

Author: Burkoff Nikolas S.
Várnai Csilla
Wild David L.
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Motivation: The problem of ab initio protein folding is one of the most difficult in modern computational biology. The prediction of residue contacts within a protein provides a more tractable immediate step. Recently introduced maximum entropy-based correlated mutation measures (CMMs), such as direct information, have been successful in predicting residue contacts. However, most correlated mutation studies focus on proteins that have large good-quality multiple sequence alignments (MSA) because the power of correlated mutation analysis falls as the size of the MSA decreases. However, even with small autogenerated MSAs, maximum entropy-based CMMs contain information. To make use of this information, in this article, we focus not on general residue contacts but contacts between residues in β-sheets. The strong constraints and prior knowledge associated with β-contacts are ideally suited for prediction using a method that incorporates an often noisy CMM. Results: Using contrastive divergence, a statistical machine learning technique, we have calculated a maximum entropy-based CMM. We have integrated this measure with a new probabilistic model for β-contact prediction, which is used to predict both residue- and strand-level contacts. Using our model on a standard non-redundant dataset, we significantly outperform a 2D recurrent neural network architecture, achieving a 5% improvement in true positives at the 5% false-positive rate at the residue level. At the strand level, our approach is competitive with the state-of-the-art single methods achieving precision of 61.0% and recall of 55.4%, while not requiring residue solvent accessibility as an input

Warwick Research Archives Portal Repository

The effect of co-conservation and co-evolution on the interface prediction.

Author: Csilla Várnai (3720010)
David L. Wild (397570)
Nikolas S. Burkoff (1303929)
Publication venue
Publication date
Field of study

Left: The fraction of proteins for which a near-native decoy is in the top scored predictions, as a function of the number of decoys considered, for the S(SRP, SN, Sent) (grey solid line), S(SRP, SN, Sent, SCMM (black solid line), S(SRP, SN, Sent, SCT (grey dashed line), S(SRP, SN, Sent, SCT, SCMM) (black dashed line) and (light grey dash-dotted line) scoring functions. Right: The number of proteins for which the rank of the top near-native prediction is within the top 1, 5 or 10 predictions, for the S(SRP, SN, Sent) (solid black bars), S(SRP, SN, Sent, SCMM (solid grey bars), S(SRP, SN, Sent, SCT (dark checked bars) and S(SRP, SN, Sent, SCT, SCMM) (light checked bars) scoring functions.</p

FigShare

MSAs in the dataset.

Author: Csilla Várnai (3720010)
David L. Wild (397570)
Nikolas S. Burkoff (1303929)
Publication venue
Publication date
Field of study

The cumulative distribution function of protein complexes in the dataset as a function of the number of sequences in their MSA. 95% of protein complexes have fewer than 400 sequences. Right: The effective number of sequences as a function of the number of amino acids in the protein complexes studied.</p

FigShare

Comparison of the interface-level scoring functions using CMM.

Author: Csilla Várnai (3720010)
David L. Wild (397570)
Nikolas S. Burkoff (1303929)
Publication venue
Publication date
Field of study

The fraction of proteins for which there is at least one near-native complex in the top predictions, for the scoring functions SCMM (black dash-dotted line), (light grey dash-dotted line), S(SRP, SN, Sent) (grey solid line), S(SRP, SN, Sent, SCMM) (black solid line) and (light grey solid line).</p

FigShare

Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation

Author: Csilla Varnai
David L. Wild
Nikolas S. Burkoff
Stephen A. Wells
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Probability distribution of the residue-level CMM scores.

Author: Csilla Várnai (3720010)
David L. Wild (397570)
Nikolas S. Burkoff (1303929)
Publication venue
Publication date
Field of study

The distribution of the standardised Z(i, j) scores for all residues (solid line) and for the interface residues of the native structure (dashed line). Left: Probability distribution function, Right: cumulative distribution function. Dash-dotted line shows 0, the mean of the standardised scores.</p

FigShare

Efficient Parameter Estimation of Generalizable Coarse-Grained Protein Force Fields Using Contrastive Divergence: A Maximum Likelihood Approach

Author: Csilla Várnai (1303926)
David L. Wild (397570)
Nikolas S. Burkoff (1303929)
Publication venue
Publication date
Field of study

Maximum Likelihood (ML) optimization schemes are widely used for parameter inference. They maximize the likelihood of some experimentally observed data, with respect to the model parameters iteratively, following the gradient of the logarithm of the likelihood. Here, we employ a ML inference scheme to infer a generalizable, physics-based coarse-grained protein model (which includes Go̅-like biasing terms to stabilize secondary structure elements in room-temperature simulations), using native conformations of a training set of proteins as the observed data. Contrastive divergence, a novel statistical machine learning technique, is used to efficiently approximate the direction of the gradient ascent, which enables the use of a large training set of proteins. Unlike previous work, the generalizability of the protein model allows the folding of peptides and a protein (protein G) which are not part of the training set. We compare the same force field with different van der Waals (vdW) potential forms: a hard cutoff model, and a Lennard-Jones (LJ) potential with vdW parameters inferred or adopted from the CHARMM or AMBER force fields. Simulations of peptides and protein G show that the LJ model with inferred parameters outperforms the hard cutoff potential, which is consistent with previous observations. Simulations using the LJ potential with inferred vdW parameters also outperforms the protein models with adopted vdW parameter values, demonstrating that model parameters generally cannot be used with force fields with different energy functions. The software is available at https://sites.google.com/site/crankite/

FigShare